Detecting Spam Content in Web Corpora

نویسندگان

  • Vít Baisa
  • Vit Suchomel
چکیده

To increase the search result rank of a website, many fake websites full of generated or semigenerated texts have been made in last years. Since we do not want this garbage in our text corpora, this is a becoming problem. This paper describes generated texts observed in the recently crawled web corpora and proposes a new way to detect such unwanted contents. The main idea of the presented approach is based on comparing frequencies of n-grams of words from the potentially forged texts with n-grams of words from a trusted corpus. As a source of spam text, fake webpages concerning loans from an English web corpus as an example of data aimed to fool search engines were used. The results show this approach is able to detect properly certain kind of forged texts with accuracy reaching almost 70 %.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution

Identifying and detecting web spam is an ongoing battle between spam-researchers and spammers which has been going on since search engines allowed searching of web pages to the modern sharing of web links via social networks. A common challenge faced by spam-researchers is the fact that new techniques depend on requiring a corpus of legitimate and spam web pages. Although large corpora of legit...

متن کامل

Detecting Content Spam on the Web through Text Diversity Analysis

Web spam is considered to be one of the greatest threats to modern search engines. Spammers use a wide range of content generation techniques known as content spam to fill search results with low quality pages. We argue that content spam must be tackled using a wide range of content quality features. In this paper we propose a set of content diversity features based on frequency rank distributi...

متن کامل

A structural, content-similarity measure for detecting spam documents on the web

Purpose The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines must deal with an annoying problem: the presence of spam documents that are ranked among legitimate ones. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. To improve the qua...

متن کامل

Analysis and Detection of Web Spam by Means of Web Content

Web Spam is one of the main difficulties that crawlers have to overcome. According to Gyöngyi and Garcia-Molina it is defined as “any deliberate human action that is meant to trigger an unjustifiably favourable relevance or importance of some web pages considering the pages’ true value”. There are several studies on characterising and detecting Web Spam pages. However, none of them deals with a...

متن کامل

DSpin: Detecting Automatically Spun Content on the Web

Web spam is an abusive search engine optimization technique that artificially boosts the search result rank of pages promoted in the spam content. A popular form of Web spam today relies upon automated spinning to avoid duplicate detection. Spinning replaces words or phrases in an input article to create new versions with vaguely similar meaning but sufficiently different appearance to avoid pl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012